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Abstract 

We introduce a new model of membership query (MQ) learning, where the learning algorithm 
is restricted to query points that are close to random examples drawn from the underlying 
distribution. The learning model is intermediate between the PAC model (Valiant, 1984) and 
the PAC+MQ model (where the queries are allowed to be arbitrary points). 

Membership query algorithms are not popular among machine learning practitioners. Apart 
from the obvious difficulty of adaptively querying labellers, it has also been observed that 
querying unnatural points leads to increased noise from human labellers (Lang and Baum, 
1992). This motivates our study of learning algorithms that make queries that are close to 
examples generated from the data distribution. 

We restrict our attention to functions defined on the n-dimensional Boolean hypercube and 
say that a membership query is local if its Hamming distance from some example in the (ran- 
dom) training data is at most 0(log(n)). We show three positive learning results in this model: 

(i) The class of 0(log(n))-depth decision trees is learnable under a large class of smooth dis- 
tributions using 0(log(n))-local queries. 

(ii) The class of polynomial-sized decision trees is learnable under product distributions using 
0(log(n))-local queries. 

(iii) The class of sparse polynomials (with coefficients in R) over {0, 1}™ is learnable under 
smooth distributions using 0(log(?i))-local queries. 
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1 Introduction 



Valiant's Probably Approximately Correct (PAC) model [Val84] has been used widely to study 
computational complexity of learning. In the PAC model, the goal is to design algorithms which can 
"learn" an unknown target function, /, from a concept class, C (for example, C may be polynomial- 
size decision trees or linear separators), where / is a boolean function over some instance space, 
X (typically X = {—1,1}" or X C W l ). The learning algorithm has access to random labeled 
examples, (x,f(x)), through an oracle, EX(f,D), where / is the unknown target concept and D is 
the target distribution. The goal of the learning algorithm is to output a hypothesis, h, with low 
error with respect to the target concept, /, under distribution, D. 

Several interesting concept classes have been shown to be learnable in the PAC framework 
(e.g. boolean conjunctions and disjunctions, /c-CNF and /c-DNF formulas (for constant k), de- 
cision lists and the class of linear separators). On the other hand, it is known that very rich 
concept classes such as polynomial-sized circuits are not PAC-learnable under cryptographic as- 
sumptions [Val84, GGM86]. The most interesting classes for which both efficient PAC learning 
algorithms and cryptographic lower bounds have remained elusive are polynomial-size decision 
trees (even log-depth decision trees) and polynomial-size DNF formulas. 

Membership Query Model: This learning setting is an extension of the PAC model and allows 
the learning algorithm to query the label of any point x of its choice in the domain. These queries are 
called membership queries. With this additional power it has been shown that the classes of finite 
automata [Ang87], monotone DNF formulas [Ang88], polynomial-size decision trees [Bsh93], and 
sparse polynomials [SS96] are learnable in polynomial time. In a celebrated result, Jackson [Jac94] 
showed that the class of DNF formulas is learnable in the PAC+MQ model under the uniform 
distribution. Jackson [Jac94] used Fourier analytic techniques to prove this result building upon 
previous work of Kushilevitz and Mansour [KM91] on learning decision trees using membership 
queries under the uniform distribution. 

Our Model: Despite several interesting theoretical results, the membership query model has not 
been received enthusiastically by machine learning practitioners. Of course, there is the obvious 
difficulty of getting labellers to perform their task while the learning algorithm is being executed. 
But another, and probably more significant, reason for this disparity is that quite often, the queries 
made by these algorithms are for labels of points that do not look like typical points sampled from 
the underlying distribution. This was observed by Lang and Baum [LB92], where experiments 
on handwritten characters and digits revealed that the query points generated by the algorithms 
often had no structure and looked meaningless to the human eye. This can cause problems for the 
learning algorithm as it may receive noisy labels for such query points. 

Motivated by the above observations, we propose a model of membership queries where the 
learning algorithm is restricted to query labels of points that "look" like points drawn from the 
distribution. In this paper, we focus our attention to the case when the instance space is the 
boolean cube, i.e. X = {— l,l} n , or X = {0,1}™. However, similar models could be defined in 
the case when X is some subset of W 1 . Suppose x is a natural example, i.e. one that was received 
as part of the training dataset (through the oracle EX(f,D)). We restrict the learning algorithm 
to make queries x', where x and x' are close in Hamming distance. More precisely, we say that a 
membership query x' is r-local with respect to a point x, if the Hamming distance, \x — x'\h, is at 
most r. 

One can imagine settings where these queries could be realistic, yet powerful. Suppose you 
want to learn a hypothesis that predicts a particular medical diagnosis using patient records. It 
could be helpful if the learning algorithm could generate a new medical record and query its label. 
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However, if the learning algorithm is entirely unconstrained, it might come up with a record that 
looks gibberish to any doctor. On the other hand, if the query chosen by the learning algorithm 
is obtained by changing an existing record in a few locations (local query), it is more likely that a 
doctor may be able to make sense of such a record. In fact, this might be a powerful way for the 
learning algorithm to identify the most important features of the record. 

It is interesting to study what power these local membership queries add to the learning setting. 
At the two extremes, are the PAC model (with 0- local queries), and MQ-model (with ndocal queries). 
However, it can be easily observed that using only ldocal queries, the class of fc-juntas can be learned 
in time poly(n, 2 k , 1/e), and the class of parities can be learned in polynomial time even in the 
presence random classification noise. These two problems are known to be notoriously difficult in the 
PAC learning setting (see [MOS03, BKW03]). In this paper, we show that 0(log(n))-local queries 
suffice for learning (certain classes of) decision trees under a large class of smooth distributions 
(see below for more details). Also, it is easy to show that in a formal sense, allowing a learner to 
make ldocal queries gives it strictly more power than in the PAC setting. In fact, essentially the 
same argument can be used to show that r + ldocal queries are more powerful than rdocal queries. 
However, o(n)docal queries are weaker than the full MQ model. These separation results can be 
easily proved under standard cryptographic assumptions, and are presented in Section 5. 

We would like to contrast our model with the popular active learning framework [SOS92]. An 
active learning algorithm can choose unlabeled points sampled from the distribution on which 
to query the oracle. Although they lead to reduced (labelled) sample complexity, such learning 
algorithms tend to be computationally inefficient. Additionally, the active learning model is weaker 
than PACdearning, and hence, we cannot hope to actively and efficiently learn richer classes like 
decision trees without major breakthroughs in PAC learning results. 

Our Results: We consider smooth distributions over the boolean cube, which we denote by 
{—1,1}™ (or sometimes by {0,1}™). We say that a distribution, D, over the boolean cube, X = 
{bo, b\} n is a-smooth if for any two points x and x' which differ in only one bit, D(x) / D(x') < a. 
Intuitively, this captures distributions which don't change very much when a small amount of noise 
is added to the example and so are in some sense smooth. Note that the uniform distribution is 
smooth with a = 1. Also, product distributions are smooth, when the mean of each bit is some 
constant bounded away from ±1 (or 0, 1). 

We will be interested in the class of smooth distributions for a constant; under these distribu- 
tions changing 0(log(n)) bits can change the weight of a point by at most a polynomial factor. 
Smooth distributions share some crucial properties with the uniform distribution, which we exploit. 
One such property is that the probability mass when d variables are fixed is in the range [cf , Cq\ for 
some constants c\ < C2- On the other hand, these distributions could be very far from the uniform 
distribution. One way to design a class of a-smooth distributions is the following: Start with an 
arbitrary (not necessarily smooth) distribution, D, and then add some independent random noise 
to each bit. 

We present the informal statements of our main results here. 

Theorem 1.1. The class of O (log (n))-depth decision trees is efficiently learnable under the class of 
a-smooth distributions, for any constant a, by a learning algorithm that only uses 0(log(n)) -local 
membership queries. 

Theorem 1.2. Let V be the class of product distributions over X = {—1, l} n , such that the mean 
of each bit is bounded away from —1 and 1 by a constant. Then, the class of polynomial- size 
decision trees is learnable with respect to the class of distributions V , by an algorithm that uses 
only 0(log(n)) -local membership queries. 
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For the above two results, we also show that these algorithms can be implemented when labels 
are corrupted by random classification noise. Since the model allows membership queries, we 
consider the setting where the noise is persistent, i.e. the first time a label for an example is queried 
it may be flipped randomly with probability n, but if queried again the same label is provided. 1 
These results are described in Section 3.4. 

Theorem 1.3. The class of t-sparse 2 polynomials (with real coefficients) over {0, l} n is efficiently 
learnable under the class of a-smooth distributions, for any constant a, by a learning algorithm 
that only uses 0(log(n) +\og(t)) -local membership queries. 

The class of sparse polynomials contains log-depth decision trees, and hence as such Theorem 1.3 
subsumes Theorem 1.1 (we present Theorem 1.1 separately because it is conceptually simpler). 
Richer concept classes are also included in the class of sparse polynomials. This includes the class 
of log-depth decision trees, where each node is a monomial (rather than a variable). A special 
case of such decision trees is 0(log(n))-term DNF expressions. When the polynomials represent 
boolean functions, our algorithm in Section 4 can easily be made to work in the presence of random 
classification noise, along the lines described in Section 3.4. 

Techniques: All our results are based on learning polynomials. It is well known that log-depth 
decision trees can be expressed as sparse polynomials of degree 0(log(n)). We point out that when 
considering 0(log(n))-degree polynomials, whether a boolean variable is considered as taking a 
value in {—1, 1} or {0, 1} does not make a difference (up to polynomial factors). However, if the 
constraint on degree is removed, the class functions that can be defined as sparse polynomials over 
{0, l} n is different from the class of functions that can be represented as sparse polynomials over 
{-1,1}". 

Our results on learning log-depth decision trees (Section 3.1) and learning sparse polynomials 
(Section 4) under smooth distributions rely on being able to identify all the important monomials 
(those with non-zero coefficient) of low-degree, using 0(log(n))-local queries. We identify a set 
of monomials, the size of which is bounded by a polynomial in the required parameters, which 
includes all the important monomials. The learning problem can then be solved easily, for example, 
by regression. The crucial idea is that using 0(log(n))-local queries, we can identify given a subset 
of variables 5C [n], whether the function on the remaining variables (those in [n] \ S), is zero or 
not. 

When we further restrict attention to uniform (or product) distributions (Sections 3.2), we 
can make use of Fourier techniques. We exploit the fact that the Fourier mass of decision trees 
is concentrated on 0(log(n))-degree terms. We show that using only 0(log(n))-local queries, it 
is possible to identify all terms that contribute significantly to the Fourier mass. In both the 
smooth distribution and uniform (product) distribution settings, our algorithms build up monomials 
starting from the empty term, adding one variable at a time. Thus, we avoid having to search over 
all possible n°^ og ( n ^ terms of degree 0(log(n)). 

One point to note is that under a-smooth distributions (which includes uniform and product) 
for a constant a, the main difficulty is designing polynomial time algorithms. Quasi-polynomial 
algorithms are trivial and in fact do not even require local-membership queries for learning the class 
of decision trees and even DNF formulas. This follows from the observation that agnostic learning 
of 0(log(n))-size parities is easy in quasi-polynomial time. 

Related work: The problem of noise in membership queries has been studied before. The work of 
Blum et al. [BCGS98] proposed a noisy model wherein membership queries made on points lying 

1 Otherwise, the true label can be easily obtained by repeatedly querying and taking the majority label. 
2 Sparsity refers to the number of non-zero co-efficients. 
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in the low probability region of the distribution are unreliable. For this model the authors design 
algorithms for learning an intersection of two halfspaces in M. n and also for learning a very special 
subclass of monotone DNF formulas. Our result on learning sparse polynomials can be compared 
with that of Schapire and Sellie [SS96], who provided an algorithm to learn sparse polynomials un- 
der arbitrary distributions in Angluin's exact learning model. However, their algorithm is required 
to make membership queries that are not local. Bshouty [Bsh93] gave an algorithm for learning 
decision trees using membership queries. In both these cases, it seems unlikely that the algorithms 
can be modified to use only local membership queries, even for the class of smooth distributions. 
Kushilevitz and Mansour [KM91] gave an algorithm for learning decision trees under the uniform 
distribution using membership queries. Their algorithm guarantees something stronger, viz. ag- 
nostic learning of parities. While our decision tree learning algorithm for the uniform distribution 
uses ideas from their work, we are unable to prove the stronger result of agnostic learning (even 
0(log(ra))-sized) parities using local membership queries. 

There has been considerable work investigating learnability beyond the PAC framework. We 
consider our results in this body of work. Many of these models are motivated by theoretical as 
well as real-world interest. On the one hand, it is interesting to study the minimum extra power 
one needs to add to the PAC setting, to make the class of polynomial-size decision trees or DNF 
formulas efficiently learnable. The work of Bshouty et al. [BMOS05] studies a passive model where 
examples are generated by a random walk on {— 1, l} n . They design algorithms for learning DNF 
formulas in this model under the uniform distribution. One could simulate random walks of length 
up to 0(log(ra)) using local membership queries, but we are unable extend their DNF learning 
algorithm to our model. The work of Kalai et al. [KST09] provided polynomial time algorithms for 
learning decision trees and DNF formulas in a framework where the learner gets to see examples 
from a smoothed distribution. 3 Their model was inspired by the celebrated smoothed analysis 
framework [ST04]. On the other hand, other models have been proposed to capture plausible 
settings when the learner may indeed have more power than in the PAC-setting. These situations 
arise for example in scientific studies where the learner may have more than just black-box access 
to the function. Two recent examples in this line of work are the learning using injection queries 
of Angluin et al. [AACY06], and learning using restriction access of Dvir et al. [DRWY12]. While 
our model is very much a black-box model, with the availability of crowdsourcing techniques and 
increased potential of on-line labellers, the model we consider may very well prove to be increasingly 
useful. 

Organization: Section 2 introduces notation, preliminaries and also formal definitions of the 
model we introduce in this paper. Section 3 presents our two results on learning decision trees, and 
the implementation of these algorithms in the presence of random classification noise. Section 4 
contains the algorithm for learning sparse multi-linear polynomials. Section 5 shows that the 
model we introduce is strictly more powerful than the PAC setting, and strictly weaker than the 
MQ setting. Finally, Section 6 discusses directions for future work. 

2 Notation and Preliminaries 

Notation: Let X be an instance space. In this paper, X is the boolean hypercube. In Section 3, 
we will use X = {—1, 1}", as we apply Fourier techniques. In Section 4, we will use X = {0, l} n 
(the class of sparse polynomials over {0, l} n is different from sparse polynomials over {—1, 1}™)- A 

3 The notion of smoothness in the work of Kalai et al. is stronger than ours. They only consider product distribu- 
tions, each bit has mean bounded away from ±1 by a constant, and further that there is some decoupling between 
the target function and distribution. 
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concept class, C, is a set of functions over X — > Y (where Y = {—1, 1} or Y = R). For a distribution, 
D, over X and any hypothesis, h : X — > {—1,1}, we define, erro(/i,, /) = Pr x ^i)[h(x) ^ /(#)]■ If 
h : X — > R, we use squared loss as the error measure, i.e. K x ^jj[(f(x) — h(x)) 2 \. To simplify 
presentation, we will keep the parameter, n, representing the size of the instance space implicit, 
rather than considering families of instance spaces and concept classes defined for all values of n. 

For some bit vector x (where bits may be {0, 1} or {—1, 1}), and any subset SC [n], x$ denotes 
the bits of x corresponding to the variables, i £ S. The set —S denotes the set [n] \ S. For 
two disjoint sets, S, T, xsxt denote the variables corresponding to the set S U T. In particular, 

X$X-S = X. 

If D is a distribution over X, for a subset S, D$ denotes the marginal distribution over variables 
in the set S. Let bs denote a function, b$ : S — >• {bo,bi}, (where {bo,bi} — {0,1} or {bo,bi} = 
{— 1, 1}). Then, xs = bs, denotes that for each i £ S,Xi = bg(i), thus the variables in the set S 
are set to the values defined by the function bs- Let n : X — s- {0, 1} denote some property (e.g. 
tt(x) = 1, if xs = bs and ir(x) = otherwise). The distribution (D\tt), denotes the conditional 
distribution, given that 7r(x) = 1, i.e. the property holds. 

PAC Learning [Val84]: Let P be a class of distributions over X and DsDbe some distribution. 
Let C be a concept class over X, and / G C. An example oracle, EX(f,D), when queried, returns 
(x,f(x)), where x is drawn randomly from distribution D. The learning algorithm in the PAC 
model has access to an example oracle, EX(/, D), where / £ C is the unknown target concept and 
D £ T> is the target distribution. The goal of the learning algorithm is to output a hypothesis, h, 
that has low error with respect to the target concept under the target distribution, i.e. erro(/i, /) = 
Pr x ~ D [h(x) + f{x)] < e. 

Membership Queries: Let / £ C be a concept defined over instance space X. Then a membership 
query is a point x £ X. A membership query oracle MQ(/), on receiving query x £ X, responds 
with value f{x). In the PAC+MQ model of learning, along with the example oracle EX(f,D), the 
learning algorithm also has access to a membership oracle, MQ(/). 

Local Membership Queries: For any point x, we say that a query x' is r-local with respect 
to x if the Hamming distance, \x — x'\h is at most r. In our model, we only allow algorithms to 
make queries that are r-local with respect to some example that it received by querying EX(/, D). 
We think of examples coming through EX(/, D) as natural examples. Thus, the learning algorithm 
draws a set of natural examples from EX(/, D) and then makes queries that are close to some 
example from this set. Formally, we define learning using r-local membership queries as follows: 

Definition 2.1 (PAC+r-local MQ Learning). Let X be the instance space, C a concept class over 
X, and T> a class of distributions over X . We say that C is PAC-learnable using r-local membership 
queries with respect to distribution class, V, if there exist a learning algorithm, C, such that for 
every e > 0, 5 > 0, for every distribution D £ T> and every concept class f £ C , the following hold: 

1. C draws a sample, S, of size m = poly(n, 1/5, 1/e) using example oracle, EX(f,D) 

2. Each query, x' , made by C to the membership query oracle, MQ(/), is r-local with respect to 
some example, x £ S 

3. C outputs a hypothesis, h, that satisfies with probability at least 1 — 5, errrj(/i, /) < e 

4- The running time of C (hence also the number of oracle accesses) is polynomial in n, 1/e, 
1/5 and the output hypothesis, h, is polynomially evaluable. 
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Smooth Distributions: Since we want to talk about smooth distributions over {—1, l} n and 
{0, l} n both, we consider X = {bo,b\} n and state the properties of interest in general terms. We 
say that a distribution D over X = {bo, b\} n is a-smooth, for a > 1, if for every pair x, x' £ X, with 
Hamming distance, \x — x'\h = 1, it holds that D{x) / D{x') < a. Thus, an a-smooth distribution 
has the property that flipping one bit of any point changes the probability mass by at most a factor 
a. We are particularly interested in the class of a-smooth distributions for a constant. For such 
distributions, changing up to 0(log(n)) bits of a point changes the probability mass by at most a 
polynomial (in n) multiplicative factor. Notice that the uniform distribution over the hypercube 
and any product distribution with means bounded away from bo and b\ (by a constant) are special 
cases of smooth distributions. 

We will repeatedly use the following useful properties of a-smooth distributions. The proof of 
these are easy and hence are omitted. 

Fact 2.2. Let D be an a-smooth distribution over X = {bo,bi} n . Then the following are true: 

1. Forbe{b ,h}, T ^.<-p TD [x i = b]< 1 ^. 

2. For any subset, S C [n], the marginal distribution, D_$ is a-smooth. 

3. For any subset S C [n], and for any property, its, that depends only on variables xs (e.g. 
x s = bs), the marginal (with respect of —S) of the conditional distribution, (D\irs)-S ^ 
a-smooth. 

\S\ ( \\S\ 

1+a , <^D[xs = b S ]<(^ 



4- (As a corollary of the above three) ( tt - ] < P?d[xs = bs] < 



5. (As a corollary of the above four) For any x £ {bo, b\} n , D{x) < 

Fourier Analysis: Here, we assume that X = {—1,1}™ (and not {0, l} n ). For 5 C [n], let 
Xs '■ ^ {— 1) 1} denote the parity function on bits in S, i.e. xs{x) = Yli^s x *- When working with 
the uniform distribution, U n , over { — 1, l} n , it is well known that (xs)sc[n] forms an orthonormal 
basis (Fourier basis) for functions / : X — > R. Hence / can be represented as a degree n, multi-linear 
polynomial over the variables {xi}, 



f{x) = f(S)xs{x) 
SQn] 



where f(S) = E x ^ Un [ Xs (x)f(x)}. Define L^f) = E S c[n] 1/(^)1. W) = E S c N f(S) 2 , L^f) = 
max SC [ n ] \f(S)\ and Lo(f) = \{S C [n] | f(S) ^ 0}|. Parseval's identity, states that E x ^u„[f 2 (x)] = 
L2(f)- For Boolean functions, i.e. with range {—1, 1}, Parseval's identity implies that ^5c[n] f(S) 2 
1 . Other useful observations that we use frequently are: (i) L,2(f) < L\{f) ■ L OQ (f); (ii) L2 (/) < 

Lo{f) ■ CM/)) 2 - 

Polynomials: In this paper, we are only concerned with multi-linear polynomials, since the domain 
is {—1, l} n or {0, 1}™. A multi-linear polynomial over n variables can be expressed as: 

f( x ) = ^2 °sYl x i 

SC[n] i€S 

When the domain is { — 1, l} n , the monomials correspond to the parity function, Xs( x ), and if the 
distribution is uniform over {—1, l} n , cs = ^ x ~u[f ( x )xs(x)] = f(S). When the domain is {0, l} n , 
the monomials correspond to conjunctions over the variables included in the monomial. We denote 
such conjunctions by, £s(x) = Ylies Xi - 

For any S C [n], define f s (x) := Etd5 c t UieT\s x i> and fs(x) = f{x) - (UieS x i) " fs( x )- 
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Algorithm: LEARNING LOG-DEPTH DECISION TREES 

Input: d (depth of DT), a, EX(/,L>), (local)- MQ(/) 

1. letS = {0}; 9 = (l + a)- d - 1 

2. for i = 1, . . . , d 

(a) For every S' 6 S, \S'\ = i — 1 and for every j € [n] \ S' 

i. Let S = S' U {j} 

ii. (Non-Zero Test) If Pr x „ D [f s (x) ^ 0] > 9, then 5 = 5 U {S} 

3. Let polynomial h(x) over terms in 5 be obtained by minimizing E x ^x>[(/(x) — h(x)) 2 ], 
constrained by \h(S)\ < t. 

Output: sign(/i(x)). 

Figure 1: Algorithm: Learning log-depth decision trees 

3 Learning Decision Trees 

In this section, we present two algorithms for learning decision trees. Section 3.1, shows that 
0(log(ra))-depth decision trees can be efficiently learned under a-smooth distributions, for constant 
a. This result is actually a special case of the result in Section 4, but we present it separately because 
it is somewhat simpler. In Section 3.2, we show that the class of polynomial-size decision trees can 
be learned under the uniform distribution. This algorithm is extended to product distributions; 
this fact is shown in Section 3.3. Section 3.4 shows that these algorithms can be made to work 
under random classification noise. 

3.1 Learning log-Depth Trees under Smooth Distributions 

In this section, the domain is assumed to be {—1, 1}™. Let / be a function that can be represented 
as a depth d decision tree with t leaves (note that t < 2 d ). We are mostly interested in the case 
when d = 0(log(n)), 4 since our algorithms run in time polynomial in 2 d . By slightly abusing 
notation, let / also denote the polynomial representing the decision tree, 

fix) = f(S)xs(x) 

SC[n] 

Also, we stick to the notation f(S) as the co-efficient of xs( x ) even though we will consider 
distributions that are not uniform over {— 1,1}™. Thus, the coefficients cannot be interpreted 
as a Fourier transform. The polynomial / has degree d. Also, it is the case that Lq(/) < t2 d , 
L>i{f) < t, and ^2(/) — 1- (These facts hold even when the distribution is not uniform and are 
standard. The reader is referred to [Man94]. However, when the distribution is not uniform, it 
is not the case that K xr ^£)[f(x) 2 ] = L2(/).) More importantly, it is also the case that if f(S) ^ 0, 
then |/(5)| >l/2 d . We will use this fact effectively in showing that depth-d decision trees can be 
efficiently learned under a-smooth distributions, when d = 0(log(n)) and a constant. 

4 When d = 0(log(n)), whether we consider the domain to be { — 1, 1}" or {0, 1}" is unimportant, since the sparsity 
of polynomial is preserved (up to polynomial factors) when going from one domain to the other. 
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The algorithm (see Fig. 1) finds all the monomials that are relevant. Call a subset S C [n] 
maximal for /, if f(S) ^ and for all T D S, f(T) = 0. We prove the following two crucial points: 

1. If S is maximal for /, all subsets of S pass the non-zero test (Step 2.(a).ii. of the algorithm). 

2. If a set T is not a subset of some S that is maximal for /, T fails the non-zero test (Step 
2.(a).ii). 

The above points can be used to prove two facts. First, that the relevant monomials can be found 
by building up from the empty set (since all subsets of maximal monomials pass the test). And 
second, that the total number of subsets that are added into the set of important monomials (in 
Step 2) is at most t2 d . This bounds the running time of the algorithm. 

Non-Zero Test: In step 2.(a).ii. of the algorithm (Fig. 1), we check the following, which we call 
as the non-zero test: P? x ~D[fs( x ) 0]. Recall, that 

fs(x) = ]T f(T)XT\s(x) 

Note that in fact, fs is a function of X-s- Let X-s be fixed. Let Us denote the uniform dis- 
tribution over xs- Observe that f(x) = fs(x) + xs( x s)fs( x -s)- Now each monomial in f-s 
is missing at least one variable from x$ (otherwise it would have been in fs). Thus, fs(x~s) = 
¥. Xs ^u s [f(xsX-s)]- Thus, if x were a point drawn from the example oracle, EX(f,D), fs(x-s) 
can be computed using |jS|-local membership queries. Now, Pz x ~D[fs( x ) 7^ 0] can be estimated 
very accurately by sampling. We ignore the analysis that employs standard Chernoff bounds and 
assume that we can perform the test in Step 2.(a).ii. with perfect accuracy. We prove the following: 

Theorem 3.1. The class of depth-0(\og(n)) decision trees is learnable using 0(log(n/e)) -local 
membership queries under the class of a-smooth distributions, for constant a, in time that is poly- 
nomial in n, 1/e, 1/8. 

The main tools required to prove Theorem 3.1 are Lemmas 3.2 and 3.3. Lemma 3.2 shows that 
if some subset S satisfies Pr[fs(x) ^ 0] > 6, then there exists some T D S such that f(T) ^ 0. 
Lemma 3.3 can be used to show that if S C T, where f{T) ^ and T is maximal for /, then 
Pi , x~d[/(^) 7^ 0] > 6. Note that what Lemma 3.3 actually proves is that if Pv x ^r>[fs(x) / 0] < 9, 
then PT x ~D[fsu{i}( x ) ^ 0] < 8(1 + a), for any i ^ S. Now, if T is maximal, and S CT, this would 
imply that PT x ~D[fT(x) 7^ 0] < 8(l + a)^~^ s ^. But, if T is maximal, the polynomial fr{x) is a non- 
zero constant polynomial. Thus, if 9 is chosen to be smaller than 1/(1 + a) d , where d = 0(log(n)) 
is a bound on the depth of the decision tree being learned, the algorithm finds all the important 
monomials. Once all the important monomials have been found, the best polynomial can easily be 
obtained, for example by regression. Alternatively, one could also solve a system of linear equations 
on the monomials and this actually learns the decision tree exactly. 5 

Lemma 3.2. For any set S C [n], if for all T D 5, f(T) = 0, then Pr x ^ D [f s {x) ^ 0] = 0. 

Proof. The polynomial fs(x) =0. □ 

Lemma 3.3. Let D be any a-smooth distribution, then for any set S, and any i £ S, P? x ~D[fsu{i}( x ) 
0] > (l/(l + a))Pw[/ s (x) #0]. 

5 By this we mean, that the output hypothesis is such that Pr x ^D[h(x) ^ f(x)\ = 0, except with some small 
probability. 
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Proof. Recall, that fs(x) = ^2tds f(T)XT\s( x )- Also, note that fs is only a function of the 
variables, X-s- Observe that, 

f S (X-s) = E f( T )XT\s(X-s) + x ifsu{i}( x -(SU{i})) 
TDS 

Note that D-s, the marginal distribution, is also a-smooth (see Fact 2.2). Let {D-s\fsu{i}( x -(Su{i})) = 
1) be the conditional distribution, given fsu{i}( x -(Su{i})) = 1- Then, the marginal distribution, 
(D-s\fsu{i}( x -(Su{i})))i is a distribution on only one variable, x%, but is also a smooth (Fact 2.2). 
But, given that fsu{i}( x -(5u{«})) 0> f° r one °f the two values, Xj = 1 or = —1, it must be the 
case that fs(x_s) 0- Since, the distribution (D^s\fsu{i}( x -(Su{i})) = l)j i s a-smooth, the proof 
of the Lemma follows from Fact 2.2. (Note that Pic x ^D[fs( x ) 7^ 0] = Pr x _ s ^D_ s [fs( x ~s) 7^ 0], 
since fs does not depend on the variables xg.) □ 

3.2 Learning Decision Trees under the Uniform Distribution 

In this section, we present an algorithm for learning i-leaf decision trees (of arbitrary depth) under 
the uniform distribution. Although, the uniform distribution is a special case of product distribu- 
tions considered in Section 3.3, the exposition is simpler and conveys the high-level ideas better. 

We use standard results from Fourier analysis; Kushilevitz and Mansour [KM91] proved the 
following useful properties of the Fourier spectrum of decision trees. Let / be a function that is 
represented by a i-leaf decision tree, then: 

1. For any set S C [n], \f(S)\ < t/2\ s \. 

2- L l (f) = Zsc [n] \f(S)\<t. 
Using the above relations, we can immediately prove the following useful (and well-known) fact. 
Fact 3.4. Suppose f is boolean function that is represented by t-leaf decision tree. Then, for any 

|S|>log(i 2 /T) 

Proof. Consider, 

E 'W< .1/(5)1- ( E 1/(5)1) 

S,|S|>lo g (Wr) s,\s\>i og (ty T) \ r , m > log(tVr) ) 

^t-ir/t^-L^Kr 

□ 

The algorithm in Figure 2 learns t-leaf decision trees under the uniform distribution. For 
simplicity of presentation, we assume that the expectations used in the algorithm and also the 
Fourier coefficients can be computed exactly. It is easy to see that using standard applications 
of Chernoff-Hoeffding bounds, the guarantees of the algorithm hold even when the expectations 
and values of the Fourier coefficients can only be computed approximately. The main step in 
Algorithm 2 that requires some explanation is how to compute the quantity K xr ^u[fs(x) 2 ] to check 
if it is greater than 9 2 . We refer to this as the L2 Test. 

L2 Test: Let x G {—1, l} n , and recall that for SC [n], fs(x) = J2tds f(T)XT\s( x )i an d that this 
can be computed by using the fact that, 

fs( x -s) = ^x s ~U s [Xs( x )fi x )] 
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Algorithm: Learning Decision Trees 
inputs: d, 6, oracles EX(f,U), (local)-MQ(/) 

1. let S = {0} 

2. for i = 1, . . . , d 

(a) for every S' G S, \S'\ = i — 1 and for every j € [n] \ S' 

i. let S = S'U {j} 

ii. (L 2 Test) if ^u[fs{x) 2 ] > 2 , then S = S U {S} 

3. let /*(*) = £ Sg5 /(S)xs(*) 
output: sign(/i(x)) 

Figure 2: Algorithm: Learning Decision Trees under the Uniform Distribution 

Given a point (x, f(x)), we observe that the expectation K xs ^u s [f(x)xs(x)] can be computed using 
2' s \, |5|-local membership queries with respect to x (only the bits in S need to be flipped). The 
quantity ^> x ~u[fs(%) 2 ] can thus be computed easily using only |5|-local membership queries and 
taking a sample from EX(/, D). 

High-Level Overview of Proof: Fact 3.4 showed that the Fourier mass (sum of squares of the 
Fourier coefficients) of t-leaf decision trees is concentrated on low degree terms. Parseval's identity 
implies that this is sufficient to construct a polynomial, h(x), that is a good £2 approximation to 
the decision tree, /, i.e. K x ~u[(h(x) — f{x)) 2 ] < e. Also, Kushilevitz and Mansour [KM91] showed 
that since L\{f) is bounded, most of the Fourier mass is concentrated on a small (polynomially 
many) number of terms. 

The main insight here is that, these terms on which most of the Fourier mass is concentrated, 
can be identified using only 0(log(n))-local membership queries. It is relatively easy to see that 
any coefficient for which |/(5*)| > 6 will be identified correctly by the test in line 2.(a).ii. (Figure 2). 
We show that the quantity |<S| never grows too large. To show this, we prove that if any coefficient 
is inserted in S in line 2.(a).ii, it must be a subset of some coefficient of large magnitude. This 
follows quite easily by observing that 7& x ~u[fs{x) 2 ] = Stds/C^) 2 an d using the fact that L\{f) 
is bounded. 

The rest of the section is devoted to a formal proof of the above overview. 
Claim 3.5. Suppose that S is such that \f{S)\ > 9 and \S\ < d, then S G S. 

Proof. First observe that for any subset S' C S, it holds that E[fs<(x) 2 ] > 6 2 . This follows 
immediately by observing that 

nfs'(x) 2 ] = ]T /(T) 2 > f(s) 2 > e 2 

TZ3S' 

It follows by a simple induction argument that at iteration i, S contains every subset of S of size 
at most i, for which E[/g(x) 2 ] > 9 2 . And, hence 5g5. □ 

Claim 3.6. If S € S, then there exists a S' 3 S such that f(S') > 6 2 /t. 
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Proof. Since S e 5, we know that E[/s(a;) 2 ] = J2sds f( s ) 2 > ° 2 - But observe that, 

£/W< (Ei/coi) 

The above inequality simply states the fact that I^Cfs) < Li(fs)L OQ (fs). Since / is a i-leaf decision 
tree, XItds' 1/(^)1 — ^i(f) — t- The claim now follows immediately. □ 

Using the above claims, it is easy to show our main theorem. 

Theorem 3.7. Algorithm in Fig. 2 run with parameters d = log(2i 2 /e) and 9 = e/(2t), outputs 
a hypothesis, sign(/i(x)), where erry(sign(/i(x), /) < e. The running time is poly(t, n, 1/e) and the 
algorithm only makes log(2i 2 /e) -local queries to the membership oracle MQ(/). 

Proof. First, we recall that for a t-leaf decision tree, \ f(S)\ < t/2^ (see [KM91]). Thus, if \f{S)\ > 
9 2 /t, then |5| < 21og(f/0). Using Parseval's identity (see Section 2), we know that the number of 
Fourier coefficients that have magnitude greater than 9 2 /t is at most t 4 /9 2 . 

Consider the set S constructed by the algorithm (Fig. 2) at the end of d iterations. If S € S, 
then there must exist some T ~D S such that |/(T)| > 9 2 /t (Claim 3.6). But there can be at most 
i 2 /6> 4 such terms and each is of size at most 21og(t/0). Hence, the \S\ < (t 2 /6» 4 )2 21o §(*/ e ) = t 4 /9 6 . 

For any coefficient, such that (/(-S 1 )) > 9, it must be that \S\ < log(t/9) < d. Claim 3.5 shows 
that all such coefficients are included in S. Thus, max^s 1/05)1 < 9. Hence, J2sgS f($) 2 — 
J2s?s \f(S)\ ■ maxses \f(S)\ < ■ 9 < 9t. But E[(h(x) - f(x)) 2 ] = £ s05 f(S) 2 and also notice 

that Pr x ^[/[sign(/i(x)) ^ f(x)] < K x ^u[(h(x) — f(x)) 2 ] (since f(x) only takes values ±1). □ 



3.3 Learning Decision Trees under Product Distributions 

In this section, we prove that the class of i-leaf decision trees can be learned under the class of 
product distributions, where each bit has mean bounded away from —1 and 1. Let fi = (m, . . . , fj, n ) 
denote a product distribution over X = {—1, l} n , where E^^Xj] = /i, € [—1 + 2c, 1 — 2c], for some 
constant c £ (0, 1/2]. We use Fourier analysis using the modified basis for the product distribution. 
We begin by introducing required notation for using Fourier techniques. 

Fourier Analysis over /i: Let fi = (//i, . . . , /j, n ) be the product distribution over X = {—1, l} n , 
where E x ^[xj] = /ij. Define, 

= II 



ies yj (1 - flf) 

Then, it is easy to observe that for any two sets Si ^ S2, ^x~n\Xsi ( x )Xs 2 ] = ® ana - that, for any 
set S, Ex^^lxgix) 2 ] = 1. Thus, the set of functions (Xs( x )) sc[n] forms an orthonormal basis for 
functions defined on {—1, l} n under the distribution fj,. For any function / : {—1, l} n —> M, the 
Fourier coefficients under distribution fi are defined as f^(S) = ^x~p[f(%)Xs( x )]- ^ ne following is 
Parseval's identity in this basis: 

E^ t [/(x) 2 ] = e nsf (i) 

SC[n] 

In particular, when / is a boolean function, i.e. with range {—1, 1}, the sum of Fourier coefficients 
is 1. 
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Let Ll{f) = E 5 c[n] \fnS)l = Esc W / V (^) 2 an d £&,(/) = max 5 c[n] |/"(S)| denote the 

1, 2 and oo norm of the Fourier spectrum under distribution fj,. Also let L^(f) = \{S \ /^(S) ^ 0}| 
denote the number of non-zero Fourier coefficients of /. We will frequently use the following useful 
observations: 

1. L£(/) < L$(f) ■ LUf) 

2. ££(/) < • (L^(/)) 2 

3.3.1 Decision Tree Learning Algorithm 



Algorithm: Learning Decision Trees 

inputs: d, 8, oracles EX(/, //), MQ(/) 

# / is a t-leaf decision tree 

# n is a product distribution over { — 1, l} n , |U, £ [— 1 + 2c, 1 — 2c] 

1. let S = {0} 

2. for i = 1, . . . , d 

(a) for every S' E S, \S'\ = i — 1 and for every j € [n] \ S" 

i. let S = S'U {j} 

ii. (L 2 Test) if E^[/ 5 (x) 2 ] > 6 2 , then 5 = 5U{S} 

3. let h(x) = E S esP(S)x^) 
output: sign(/i(x)) 

Figure 3: Algorithm: Learning Decision Trees under Product Distributions 

We present a high-level overview of our algorithm and a formal statement of the main result, 
before providing full details. The Algorithm is described in Figure 3. 

Truncation: We show that a t-leaf decision tree, when truncated to logarithmic depth, is still a 
very good (inverse polynomially close) approximation to the original decision tree. This observation 
can be used to show that it suffices to identify low-degree (logarithmic) "heavy" Fourier coefficients 
of /, with respect to the distribution, /z, and also that the number of such terms is not too large (at 
most polynomial). Note that this is not as simple as in the case of the uniform distribution, because 
it is not straightforward to bound L^(f) = J2sc[n] \f^(^)\- (When /i is the uniform distribution, 
this is bounded by t.) Properties of such truncated decision trees were also used by Kalai et 
al. [KST09] in the smoothed analysis setting. 

A t-leaf decision tree can be though of as t (not disjoint) paths from root to leaves. A truncation 
of a decision tree at depth d, is a decision tree where for each path of length more than d, only the 
prefix (from root) of length d is preserved. Note that this may collapse several paths to the same 
prefix, possibly reducing the number of leaves. A new leaf is added at the end of this path and 
labeled arbitrarily as —1 or +1. 

For any function g, we denote by Sg , the set of non-zero Fourier coefficients of g, with respect 
to the product distribution, //, i.e. Sg = {T C [n] | g(T) ^ 0}. 
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We prove two useful properties of the truncated decision trees with respect to product distri- 
bution. These appear as formal statements in Lemmas 3.8 and 3.9. Similar observations were also 
used by [KST09] to prove learning of decision trees in the smoothed analysis setting. 

(i) Truncation at logarithmic depth is a good approximation (inverse polynomial) to the original 
decision tree. 

(ii) The number of nonzero Fourier coefficients of the truncated decision tree, \Sg\ is small (poly- 
nomial). 

Lemma 3.8. Let f be a t-leaf decision tree, let fi be a product distribution over X = {— 1, l} n such 
that fii € [—1 + 2c, 1 — 2c]. Then for every r > 0, there exists a t-leaf decision tree of depth at most 
log(t/r)/log(l/(l - c)), such that Pr x ^[g(x) ^ f(x)} < r 

Proof. Let g be the decision tree obtained by truncating / at depth d. The new leaves added at 
depth d can be labeled arbitrarily. Now, the points x for which g{x) ^ f{x) are precisely those, for 
which g would lead to the newly added leaf node at depth d. But since E X ^ M [xj\ G [— 1 + 2c, 1 — 2c] , 
the probability that a random point from fi reaches such a node is at most (1 — c) d . The number 
of new leaf nodes added cannot be more than t (since any truncation only reduces the number of 
leaves). Thus, Vi x ^^[g(x) / f(x)] < t(l - c) d . When, d = log(f/r)/log(l/(l - c)) we get the 
result. □ 

Lemma 3.9. Let g be a decision tree of depth d and t leaves; then the number of non-zero Fourier 
coefficients of g is at most t • 2 d and each is of size at most d. 

Proof. We consider any path in g from root to leaf, and let P denote the subset of indexes corre- 
sponding to the variable that occur in the path. First, we expand decision tree g as a polynomial. 

path p *eP 

where op^ is +1 or —1, depending on whether the path leading out of node labeled Xi on path P 
was labeled +1 or —1, and yp is the label of the leaf at the end of the path P. 

The only nonzero coefficients in g are of the from Xi ^ or some T C P for some path P. This 
also means that the only non-zero Fourier coefficients can be those corresponding to such subsets. 
This is because E x ^[x^,(a;) Y\ ieS xi\ = 0, unless T C S (because \i is a product distribution). Since 
the number of paths in g is at most t and the length of each path is at most d, we get the required 
result. □ 

Lemma 3.10. Let f be a t-leaf decision tree, let g be a truncation of f to depth\og{At/r) /log(l /(l— 
c)). Then, 

E US?<r 

Proof. Let g be a truncation of / at depth log(4t/r)/log(l/(l — c)). Let Sg denote the set of non- 
zero Fourier coefficients of g under distribution fi. Using Lemma 3.8, we know that ~P? x ~n[f(x) ^ 
g(x)] < r/4, hence E[(/(x) — g(x)) 2 ] < r. Now, by Parseval's identity: 

r>K x ^[(f(x)-g(x)) 2 } 

= £ (f(S) - g(S)) 2 + £ f(Sf 

> E /» 2 

S?Sg 
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The proof is complete by observing that every coefficient S G Sg satisfies |5| < log (4i/r) / log (1/(1 — 
c)) by Lemma 3.9. □ 

Li Test: As in the case of uniform distribution, we write f{x) as: 

where, ft s {x) = Y,t,S£T /^COXt^) and fs( x ) = J2tds / m ( t )Xt\s( x )- Then as in the case of 
uniform distribution, fs(x) = fs{x-s) = ^s^sl/l^JXsWli where now x$ is drawn from the 
restriction fis of the product distribution to the bits x$- Note that for any given point x, fs(x) 
can be computed easily using 2^ membership queries that are |5|-local (since only the bits x$ 
need to be changed). We point out that there is a subtle point in the case of product distributions. 
Recall that fs{x) = K Xgr ^ fJiS [f(x)xg(x)]. In the case when \i is the uniform distribution, the 
parity functions, xs are {—1,1} valued, and so fs(x) G [—1,1]. Thus, application of Chernoff- 
Hoeffding bounds is straightforward. In the case, of product distributions the range of Xs( x ) can 

be [ _ Ili6s(( 1_ lMil)/('v/l - A^^rheS^ + I^D/fV 1 ~~ /■*?))]■ Since > we never consider sets S that 
are larger than 0(log(n/e)), the range of fs in our case is still polynomially bounded and arbitrarily 
good (inverse polynomial) estimates to the true expectation of ~& x ~iJ,[fs{ x ) 2 ] can be obtained by 
taking a sample and applying Chernoff-Hoeffding bounds. Thus, to simplify the presentation, we 
assume we can compute the expectation (in Line 2.a.ii in Fig. 3) and the Fourier coefficients exactly. 

Theorem 3.11 is the statement of the formal result about learning decision trees under product 
distributions. The main ideas are similar to the proof in the case of uniform distribution; but, the 
proof is more involved as explained above. 

Theorem 3.11. Algorithm in Fig. 3 with parameters 6 = v / e/(2t(8t/e) 1 /iog(V(i-c))) ) 
d = log(8t/e)/log(l/(l — c)), outputs a hypothesis sign(/i(x)), such that err M (sign(/i(x)), /) < e. 
The running time of the algorithm is polynomial in n, t and 1/e and the algorithm makes only 
0(log(rai/e)) -local membership queries to the oracle MQ(/). 

The rest of this section is devoted to the proof of Theorem 3.11. 

Claim 3.12. If S is such that \ fr(S)\ > 9 and \S\ < d, then S ES. 

Proof. This proof is thee same as the proof of Claim 3.5. □ 

Claim 3.13. If S G S, then there exists S' D S, such that f^(S') 2 > {6 2 /2)/{t - (8t/6 2 ) 1/los{1/{1 ~ c)) ) 
and \S'\ < log(8t/0 2 )/log(l/(l - c)). 

Proof. Let r = 6 2 /2 and let g' be the decision tree obtained by truncation of / as described in 
Lemma 3.10. Then, by Lemma 3.9, we know the depth of g' is log(8i/# 2 )/log(l/(l — c)) and that 
S£ is of size at most t ■ 2 lo s( 8i /e 2 )/iog(i/(i- c )) = t . J 8 ^2)i/log(i/(l-c))_ AlsQ) by Lemma 3-10 we 

know that Y.t<£S» F( t ) 2 < ° 2 / 2 , and hence if s P asses the ^2-test, i.e. Etd5/ /1 ( t ) 2 > & \ 

g' ~ 

it must be that ^2t^s tgs^ f^(T) 2 ^ 2 /2. Hence, there must be some set S' of size at most 
log(8V# 2 )/log(l/(l - c )) for which f»{S') 2 > {9 2 /2)/{t ■ {U/e 2 ) 1 / 10 ^ 1 ^ 1 ^). □ 

Proof of Theorem 3.11. Let g be the truncation of the target decision tree, /, to depth d. Then 
using Lemma 3.10, we know that Ess?^ f^(S) 2 < e/2. Now, every coefficient in S G Sg for which 

\f^(S)\ > 6 is in S (see Algorithm 3 and Claim 3.12). | < t2 d . Tedious calculations show that 
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J2 s ^ n \f(S)\<ef^ 2 ^ t2d ° 2 ^ £ / 2 - Thus ' ^Sesf(S) 2 > Esesns* f(S) 2 > 1 - e. This implies 
by Parseval, that E xr ^[(/i(x) — f(x)) 2 ] < e, where h(x) is as defined in Algorithm 3. 

The only thing remaining to show is that \S\ always remains bounded by poly(i,n, 1/e). This 
can be shown easily using Claim 3.13, since if S G <S, there exists S' 5 S, such that \S'\ < 
log(8i/# 2 )/log(l/(l - c)) and />(5) 2 > (9 2 /2)/(t ■ (8i/^ 2 ) 1/log(1/(1-c)) - Thus, the magnitude of 
f(S') 2 is at least l/poly(i, n, 1/e), so by Parseval there can be at most poly (t, n, 1/e). Also the 
size of \S'\ is 0(log(tn/e)), thus the total number of irrelevant subsets added to S is at most 
poly(i,n, 1/e). □ 

3.4 Learning under Random Classification Noise 

In this section, we show how the algorithms for learning decision trees can be implemented even 
with access to a noisy oracle. The learning algorithm we use is allowed queries to the membership 
oracle, MQ(/), therefore we consider a persistent random noise model. An easy way to conceptualize 
this model is as follows: Let £ : {—1,1}" — > {—1,1} be a function where for each x £ {— l,l} n , 
the value of £(x) = 1 with probability 1 — r/ and —1 with probability r], independently. Once this 
noise function, £, has been fixed, we assume that we have access to the function: f 1 = f ■ rather 
than the function /. We show how the tests mentioned in this section can be implemented using 
EX(P,D) and MQ(/ T? ), rather than EX(f,D) and MQ(/). 

3.4.1 Non-Zero Test 

Recall that we are interested in estimating Pr[fs(x) ^ 0], where S C [n], and 

f s (x)=K xs ^ Us [f(x)xs(x)] (2) 
Instead, if we have access to f 71 , we are able to compute, 

f s (x)=E xs „ Us [P(x) XS (x)] 

Although, the random classification noise is persistent and fixed according to for the purpose of 
analysis it is easier to imagine that for each x, £(x) is only determined when the algorithm makes 
a query for the point x (or x is drawn by EX(/ , ',L')). Lemma 3.14 allows us to conclude that the 
test required in Section 3.1 can be performed using access to f 1 instead of /. The lemma assumes 
that C(x) is chosen independently, each time x is queried, i.e. the noise is not persistent. However, 
we show later that our algorithm queries each example only once, so the noise may as well have 
been persistent. 

Lemma 3.14. The following are true: 

1. Pt x ,Mx) + 0] > (1 - po) + Pv[f s (x) + 0] 

2. Pr XtC [f%(x) ^ 0] < (1 - po) + Pr[/s(x) + 0] 

Here, cq is an absolute constant, po depends only on \S\ and r\. The probability is taken over the 
choice of x ~ D and choice of 

Proof. Note that fs(x) = K xsrsj jj s [f(x)xs( x )], an d so fs(x) is evaluated by using 2l s l different 
values of f(x). For every x, f(x) € {—1, 1}, and hence if fs(x) = 0, it must be that the 2l s l values 
used in the expectation have exactly 2^ s ^~ 1 +ls and —Is each. 
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On the other hand, if /s(x) ^ 0, then the number of +ls is different than —Is. If fs(x) 7^ 0, 
without loss of generality, we only consider the case when fs(x) > 0, so that there are more +ls 
than —Is. Thus, we are left with the following combinatorial question: 

Suppose we begin with 2k variables, x±, . . . , X2k, where each X{ is +1 or —1. Let k\ be the 
number of +ls and 2k — k\ is the number of —Is. We will assume throughout that k > 2. We 
perform the following process, each Xj is left as is with probability 1 — r] and its sign flipped with 
probability r/, independently. Let x\ be the values of the resulting variables, and let X' = ^2 i x' i . 
Let p\ denote the probability that X' is having started with (k + i) +ls and (k — i) —Is. Thus, 
Pq is the probability of getting a 0, when we start with equal number of +ls and —Is. 

Then the following are true: 

1. p\ decreases as i increases. 

2. Pq — Pi > (2rj — l) 2 co/A: 3 / 2 for some absolute constant cq. 

The proof of the above facts is provided in appendix A, though it should be fairly clear that the 
conclusions make sense. When ij = 1/2, the initial values are irrelevant of the Xj are irrelevant and 
each x\ = ±1 with probability 1/2, but for r] < 1/2, if one started with the sum Y2i x i = 0, it is 
more likely that Y2i x[ = 0, than if one started from some value, Xj that was greater than 0. 

We apply the above to the setting when k = 2l s 'l~ 1 . We drop the superscripts Po' S| 1 and pf' S| 
in the rest of this discussion. First, imagine that we have fixed the variables x_$ so that the 
expectation (2) is only a function of the noise function Q. If fs(x-s) = 0, then Pr^/^x-s) = 0] = 
Pq. On the other hand, if fs(xs) ^ 0, then < Pr^[/ r? (x_5) = 0] <p\. So, we have the following: 

Pr[/j|(x) + 0] > Pr[/ S (x) + 0](1 - pi) + Pr[/ S (i) = 0](1 - p ) 
x£ x x 

= (l-Po) + (Po-Pi) Pr [/ S (x)^0] 

On the other hand, 

Pr[/|(x) + 0] < Pr[/ 5 (x) + 0] + (1 - p ) Pr[/ 5 (x) + 0] 

x,C X X 

< (l-p ) + Pr[/s(x)^0] 

X 

This completes the proof of the assertion. □ 

We note that this allows us to distinguish between the cases where Pr x ^£)[/s(x) 7^ 0] > a 
from Pr x ^£)[/s(x) 7^ 0] < /3, as long as a — (3 is sufficiently large. This can be done by choosing 
/3 = a ■ (2rj - l) 2 c /(2 • 2 3 l s l/ 2 ), and then computing the value Pr x ^ D [/^(x) ^ 0]. Note that p 
can be computed exactly, if the size |5| and the noise rate r] are known. We assume that the noise 
rate is known; if not, the standard trick of binary searching the noise rate can be employed. Note 
that these tests can be carried out to high accuracy from samples. Now, in the case when D is 
an a-smooth distribution for constant a, any two points x and x' drawn from EX(/, D) will have 
Hamming distance Q(n) with very high probability. The local queries to MQ(f v ) are only made for 
points that are at Hamming distance 0(log(n)) from sampled points (see Fact 2.2). Thus, with very 
high probability, the queries made to compute fg(x) and fg{x') do not have any point in common, 
i.e. no example is queried twice by the learning algorithm. So we can employ Lemma 3.14 as if the 
noise was chosen independently each time a point was queried. 
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3.4.2 L 2 Test 



Recall that fg(xs) = Ei s e{-i i}l s l [f v ( x s x -s)]- For a fixed xs, f v (x_s) 1S a random variable 
depending only on the noise function £. Let 2^ s ^fs(x) = 2k, where 2k is some even integer in the 
range [-2l s l,2l 5 l]. Let h = 2\ s \~ 1 + k = 2\ S \~ 1 {1 + f s (x)) and k 2 = ^^-k = ^^{l- f s (x)), so 
that 2\ s \fs(x) is a sum of ki, +ls and k 2 , —Is. Let Z\ ~ Bin(/ci,r/) and Z 2 ~ Bin(/c2,?/) be binomial 
random variables. Then 2^ s ^ f^(x-s) = 2^fs{x) — 2Z\ + 2Z 2 . This follows immediately from the 
definition of the noise model. The following can then be verified by straightforward calculations, 

E c [/3(*_s)] = (l- 277)/s(a0 
E c [^(x_ s ) 2 ] = (1 - 2r,) 2 f s (x) 2 + 2~^ +1 v (l - r,) 

Thus, if we can obtain accurate estimates of E, Xr ^D[f^(x) 2 ], we can also obtain accurate estimates of 
^x~D[fs{ x ) 2 ]- Again, as in the previous case, we observe that the algorithm (with high probability) 
never makes a query twice for the same example. Thus, we can assume that the noise model is in 
fact not persistent. It is clear that ^ x ~D[fs(x) 2 ] can be estimated highly accurately by sampling. 

4 Learning Multilinear Polynomials under Smooth Distributions 

In this section, we consider the problem of learning t-sparse polynomials with co-efficients over R 
(or Q), when the domain is restricted to {0, l} n . In this case, we may as well assume that the 
polynomials are multi-linear. We assume that the absolute values of the coefficients are bounded 
by B, and hence the polynomials take values in [—tB,tB], on the domain {0, l} n . For a subset 

5 C [n], let ^s( x ) = rLeS x *> ^ us £s( x ) is the monomial corresponding to the variables in the set 
S. Note that any t-sparse multi-linear polynomial can be represented as, 

n 

f( x ) = /^2c s £s{ x ), 
i=i 

where cs € R, \{S \ cs ^ 0}| <t, and \cs\ < B for all S. Let R" B [X] denote the class of multi-linear 
polynomials over n variables with coefficients in R, where at most t coefficients are non-zero and 
all coefficients have magnitude at most B. 

We assume that we have an infinite precision computation model for reals. 6 Also, since the 
polynomials may take on arbitrary real values, we use squared loss as the notion of error. For a 
distribution, D over {0, l} n , the squared loss between polynomials, / and h, is E, Xr ^£)[(f (x) — h(x)) 2 ]. 

Our main result is: 

Theorem 4.1. The class M£ B [X], is learnable with respect the class of a- smooth distributions over 
{0, 1} U , using 0(log(n/e) + log(t/e)) -local MQs and in time that is polynomial in (ntB/e) a . The 
output hypothesis is a multi-linear polynomial, h, such that [(h(x) — f(x)) 2 ] < e. 

Recall that for a subset, S, x$ denotes the variables that are in S; and that —S denotes the set 
[n] \ S. Let fs(x-s) denote the multi-linear polynomial defined only on variables in 

fs(x-s) = c sutCt{x~s) 

TC-S 

We first describe the high-level idea of the proof of Theorem 4.1. Algorithm 4 outputs a 
hypothesis that approximates the polynomial /. 
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Algorithm: LEARNING t-SPARSE POLYNOMIALS 
inputs: d, 9, oracles EX(f,D), (local) MQ(/) 

1. let S = 

2. repeat (while some new set is added to S) 

(a) For every S' € <S, \S'\ < d — 1 and for every j £ —5' 

i. let 5 = S' U {j} 

ii. if Pr D _ s [fs(x- S ) ± 0] > 9, then 5 = S U {5} 

3. Perform regression to identify a polynomial h = h[S]^s( x ), that minimizes E[(/(x) — 
h(x)) 2 ], subject to: 

(a) h[S] = for 5 £ 5. 

(b) EslMS]|<«? 

output /i(x) 



Figure 4: Algorithm: Learning i-Sparse Polynomials 

Truncation: First, we show that there are low-degree polynomials that approximate the multi- 
linear polynomial, /, up to arbitrary (inverse polynomial) accuracy. These polynomials are the 
truncations of / itself. Let f d denote the multi-linear polynomial obtained from / by discarding 
all the terms of degree at least d + 1. Note that f d is multi-linear and t-sparse, and has coefficients 
of magnitude at most B. Thus, 

f d (x) = Y, csts(x) 

SC[n] 
\S\<d 

Now, observe that because D is a-smooth, the probability that ^s(x) = 1 is at most (a/(l+a))\ s \ 
(see Fact 2.2). Thus, the probability that at least one term of degree > d + 1 in / is non-zero, 
is at most t(a/(l + a)) d by a union bound. Thus, Pr x ^z)[/(x) ^ f d (x)] < t(a/(l + a)) d . Also, 
since |/(ac)| < tB and \f d (x)\ < tB, this implies that E Xr ^ D [(f(x) - f d (x)) 2 ] < 4t 3 B 2 (a/(l + a)) d . 
By choosing d appropriately, when a is a constant this quantity can be made arbitrarily (inverse 
polynomial) small. 

Step 3 of the Algorithm (see Fig. 4) identifies all the important coefficients of the polynomial, 
/. Suppose, we could guarantee that the set, S, contains all coefficients 5, such that cs ^ and 
IS 1 ) < d, i.e. all non-zero coefficients of f d are identified. This guarantees that the regression in step 
4 will give a good approximation to /, since the error of the hypothesis obtained by regression has 
to be smaller than K x ^£,[(f(x) — f d (x)) 2 ]. (The generalization guarantees are fairly standard and 
are described later.) 

Identifying Important Monomials: In order to test whether or not a monomial, S, is important, 
the algorithm checks whether Pr£>_ s [fs(x~s) 7^ 0] > 9. Here, D^s is the marginal distribution over 
the variables We assume that this test can be performed perfectly accurately. (The analysis 
using samples is standard by applying appropriate Chernoff-Hoeffding bounds.) 

6 The case when we have bounded precision can be handled easily since our algorithms run in time polynomial in 
B, but is more cumbersome. 
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In Lemma 4.4, we show that if the polynomial fs(xs) has a non-zero coefficient of degree at 
most d — \S\, then the probability that fs(x~s) 7^ is at least (1/(1 + a)) d+log ^ . In Lemma 4.5, 
we show that if fs(xs) has no non-zero coefficient of degree less than d' = 0((d + ]n{t)) ln(l + a)), 
then fs(x-s) 7^ with probability at most 0.5(1/(1 + a)) d+log ^^. Thus, we will never add any 
subset S, unless there is some co-efficient T in / of size at most d' = 0{{d + ln(i)) ln(l + a)) and 
S C T. However, the number of such T is at most t, and each such set can have at most 2 d subsets. 
This bounds the total number of subsets the algorithm may add to S, and hence, also the running 
time of the algorithm (to polynomial in the required parameters) . 

Note that sampling, x_s according to D-s ls trivial, just draw random example from EX(/, D) 
and ignore the variables xs- Let Us denote the uniform distribution over variables in xs- Then we 
have, 

f s (x-s) = Ex^f/sP 151 n( 2 ^ " !)/(*)] (3) 

i£S 

The variables in x-s are fixed, the expectation is only taken over the uniform distribution over 
variables in xs- Notice that for any i 6 S, since Xi € {0, 1}, ¥, Xs [(2xi — 1)] = and K Xs [(2xi — 
l)xi] = 1/2. Thus, in the RHS of (3), if S % T, E Xs [2^ Y[ i&s {2 Xi - l)%r{x)\ = 0, and USQT, 
E xs [2l s l Yli£s(^ x i ~ 1)£t(x)] = £,t\s( x -s)- Thus, the relation in (3) is true. Also, this means that 
if the example, x, is received by querying the oracle, EX(/, D), fs(x-s) can be obtained by making 
0(|<S|)-local membership queries to the oracle MQ(/). 

We now give formal proofs of the ideas explained above. 

4.1 Truncation 

We show that truncation (to log-degree) does not change the polynomial significantly, under smooth 
distributions. 

Lemma 4.2. Let f be a t-sparse multi-linear polynomial and let D be an a-smooth distribution. 
Let f d be the polynomial f where all terms of degree greater than d are set to 0. Then, 

Vr[f d {x)^f{x)]<t(^-^\ 

and hence, 

E D [(f(x)-f d (x)) 2 ]<U 3 B 2 (-^-\ 

Proof. Note that for any x, if f(x) 7^ fd(x), there must be a term of / of degree at least d that is 
not 0. For any fixed monomial of / of degree at least d, the probability that it is non-zero for a 
random point of x drawn under an a-smooth D is at most (a/(l + a)) d (see Fact 2.2). Taking a 
union bound over the t possible terms gives the result. □ 

4.2 Identifying Important Monomials 

First, we have the following useful general lemma. 

Lemma 4.3. Let f be a t-sparse multi-linear polynomial defined over any field, ¥, with a non-zero 
constant term, cq. Let D be any a-smooth distribution over {0, l} n , then 

£!/<*> *<>]>(— ) 



20 



Proof. We prove this by induction on the number of variables, n. When n = 1, the only possible 
polynomials are f{x\) = cq + c\x\. Then f(x) = if and only if x\ = 1 and c\ = — Co (since cq / 0). 
Note that when D is a-smooth, Pr[x x = 1] < a/(l + a) (see Fact 2.2). Thus, Pr£,[/(a;i) ^ 0] > 
1/(1 + a). (And the sparsity is 2, and log(2) = 1.) Thus the base case is verified. 

Let / be any multi-linear polynomial defined over n variables. Suppose there exists a variable, 
without loss of generality, say x\, such that c\X\ is a term in /, where c\ ^ 0. Then we can write 
/ as follows: 

fi x ) = f-i{x) + xif 1 (x) 

where f-\ and f\ are both multi-linear polynomials over n — 1 variables and both have a non-zero 
constant term. (The constant term of /_i is just Co, and f\ has constant term c\.) Then note that 
1/(1 + a) < VidIxi = 6|x_i] < a/(l + a), for both b = 1 and b = 0. Now, it is easy to see that 
Pr D [/(x)^0] >Prj,[xi = 0|s_i]Pr D [/_i(x)/0] > (1/(1 + a)) Pr[/_i(x) + 0]. 

To see that Prc>[/(x) / 0] > (1/(1 + a)) Pro [/i (a:) / 0] consider the following: Fix x_i, if 
Pr[/i(x) ^ 0], then for at least one setting of x±, it must be the case that f(x) ^ 0. Thus, 
conditioned on X-i, Pro[/(x) ^ 0|x_i] > (1/(1 + a))5{f-i(x) ^ 0) (here 5(-) is the indicator 
function). Thus, Pro[/(x) ^ 0] > (1/(1 + a)) Pr£>[/„i(x) ^ 0]. However, at least one of f\ 
must have sparsity at most t/2, thus by induction we are done. 

In the case, that there is no Xj such that CjXj (with Cj ^ 0) appears in / as a term, let Jq be the 
polynomial obtained from / by setting x\ = and f\ be the polynomial obtained from / by setting 
x\ = 1. Note that both /o and /i have constant term cq ^ and sparsity at most t, but they have 
one fewer variable than /. Thus, Prrj[fi,(x) ^ 0] > (1/(1 + a)) log2 ^\ for b = 0, 1. However, note 
that 

Pr[/(x) ^ 0] = Pr[xi = 0] Pr [/<,(*) + 0] + Pr[xi = 1] Pr[/x(x) + 0] 

U U L/Q JJ JJi 



> 



\ log(t) 



1 + a / 

This completes the induction. □ 

Using Lemma 4.3, we can now show that step 3 in the algorithm correctly identifies all the 
important monomials (monomials of low-degree with non-zero coefficients in /). 

Lemma 4.4. Suppose S C [n], such that fs(x) has a monomial of degree at most d — \S\ with 
non-zero coefficient. Then, 



Pr \f 8 (x-s) ± 0] > 
D-s \i + a 



d-|S|+log(t) 



Proof. Note that, since D is an a-smooth distribution, is also an a-smooth distribution (see 
Fact 2.2). Let S' be a subset of —S, such that £s'( x -s) is the smallest degree monomial in fs(x~s) 
with non-zero coefficient. Then, since -D_s is a-smooth, Pid_ s [Cs'( x -s) = 1] > (1/(1 + a))' 5 ' > 
(l/(l + a))<H s l 

Now, the conditional distribution (.D-sl^g^X-s) = 1) is not a-smooth, but the marginal dis- 
tribution with respect to variables —(S U 5'), (D-s\(€s'( x -s) = l))-(5u5')' * s i n deed a-smooth 
(see Fact 2.2). Let f§ ( x -(suS')) ^ e ^ e polynomial obtained from fs by setting Xj = 1 for each 
i € S' . Note that the constant term of fg is non-zero and it is ^-sparse, and is only defined on the 
variables in —(SU S'). Hence, by applying Lemma 4.3 to fg and the marginal (w.r.t the variables 
—S') of the conditional distribution (D_s\£,s'( x -s) = 1)> (D_s\£,s'( x ) = l)-(sus")' we ^ the 
required result. □ 
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Next, we show the following simple lemma that will allow us to conclude that step 3 of the 
algorithm never adds too many terms. 

Lemma 4.5. If each term of fg has degree at least d! , then the probability that fs(x) 7^ is at 
most t(a/{\ + a)) d ' . 

Proof. Note that each monomial of f$ has degree at least d! . Under any a-smooth distribution, the 
probability that a monomial of degree d' is not-zero is at most (a/(l + a)) d ' (see Fact 2.2). bmce, 
-D_s is an a-smooth distribution, by a simple union bound we get the required result. □ 

Now in order to get an e-approximation in terms of squared error, using Lemma 4.2, it is clear 
that it suffices to choose d = log(4t 3 I? 2 /e)/ log((l + a)/a), and consider the truncation a. For this 
value of d, if 6 is set to l/(4i 3 5 2 ) 21o s( 1+Q )/ lo s(( 1+a )/ Q ), using Lemma 4.4, we are sure that all the 
monomials in / of degree at most d that have non-zero coefficients are identified in step 3 of the 
algorithm. Note that 6 is still inverse polynomial in (ntB/e) a . 

Finally, we note that if a" is set to log(2i/#)/log((l + a)/a), then for any subset, S, if the 
monomial with the least degree in fs, has degree at least d', then Prjj_ s [fs(x) ^ 0] < 6/2. In 
particular, this means that if a set, S, with l^l < d, is such that the smallest monomial, £,t(x) in / 
for which S C T, is such that \T\ > d + d' , then S will never be added to S by the algorithms. The 
fact that this probability was 6/2 (instead of exactly 6), means that sampling can be used carry out 
the test in the algorithm to reasonable accuracy. Finally, observe that t2 d+d is still polynomial in 
(ntB/e) a . Thus, the total number of sets added in S, can never be more than polynomially many. 
Generalization The generalization argument is pretty standard and so we just present an outline. 

First, we observe that it is fine to discretize real numbers to some A, where A is inverse polynomial 
in (ntB/e) a , without blowing up the squared loss. Now, the regression in the algorithm requires 
that the sum of absolute values of the coefficients of the polynomial, h, be at most tB. Thus, 
we can view this as distributing tB/A blocks over 2 n possible coefficients (in fact the number of 
coefficients is smaller). The total number of such discretized polynomials is at most 2 poly (( ntB/ ' <: ) a ). 
Thus, it suffices to minimize the squared error on a (reasonably large) sample. 

5 Separation Results 

In this section, we show that PAC+r-local MQ model is strictly more powerful that the PAC model, 
assuming that the class of polynomial-size circuits is not PAC-learnable. In the following discussion 
we show that even 1-local membership queries are more powerful than the standard PAC setting. 
We note that it is known that the class of -juntas is known to be learnable in poly(n, 2 fc ) time 
with 1-local membership query. 

In this section, we assume that we are working with the domain {0, l} n , rather than {—1, l} n . 
Let T n = {f s : {0, l} n — > {0, l}} s g{o,i}' 1 be a pseudo-random family of functions. It is well-known 
that such families can be constructed under the assumptions that one-way functions exist [GGM86]. 
Let A\,...,A n be a partition of {0,1}" that is easily computable. For example, if the strings 
in {0, l} n are lexicographically ordered, then Ai contains strings with rank in the range [{i — 
l)2 n /n, i2 n /n). For an n + 1 bit string x, X-\ denotes the n-length suffix of x. Then, for some 
string s, define the function g s : {0, l} n+1 — > {0, 1} as follows: 




fs{X- X ) lfx 1= 

f s (x-i) © Si If x\ = 1 and x_i G Ai 
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Define Gn+i = {ds ■ {0, l} n+1 — > {0, l}} s e{o,i} n - We show below that the class Q n +i is not learnable 
in the PAC setting, but is learnable in the PAC+l-local MQ model under the uniform distribution. 

Theorem 5.1. Assuming that one-way functions exist, the class G n +i is n °t learnable in the PAC 
model, but is learnable in the PAC+l-local MQ model, under the uniform distribution. 

Proof. First, we show that G n +i is learnable in the PAC+l-local MQ model. Let lx and Ox be the 
two strings of length n + 1, with suffix x £ {0, l} n . Then for any g s € Gn+i, ffs(l^) © 9s(0x) = Sj, if 
x 6 Ai. Thus, drawing a random example from U and making a one local query reveals one bit of 
the string s. By drawing 0(nlog(n)) random examples, all the bits of the string s can be recovered 
with high probability. Thus, revealing the function g s itself. 

On the other hand, in the PAC model, the probability that seeing two examples lx and Ox is 
exponentially small. Thus, all the labels appear perfectly random (since f s is from a pseudorandom 
family). Thus, no learning is possible in the PAC model. □ 

In fact, the above construction also shows that the random walk learning model (see [BMOS05]) 
is also more powerful than the PAC learning setting, assuming that one way function exist. Bshouty 
et al. [BMOS05] had already shown that the random walk model is provably weaker than the full 
MQ model assuming that one-way functions exist. In fact, essentially the same argument also shows 
that full MQ is more powerful than PAC+o(n)-local MQ. The following simple concept class (which 
is the same as that of Bshouty et al.) shows the necessary separation. 

Let e l be the vector that has 1 in the i th position, and 0s elsewhere. Again, let J- n = {f s '■ 
{0, l} n — > {0, l}} s e{o.i}™ De the pseudorandom family of functions. Then define, Q' n = {g s } as 
follows: 




Si If x = e % 
fs{x) Otherwise 



Theorem 5.2. The concept class Q' n is learnable in the full MQ model, but not in PAC +o{n) -local 
MQ model under the uniform distribution. 

Proof. It is easy to see that by making membership queries to the points, e 1 , . . . ,e n , the string s 
is revealed and hence also the function g s . On the other hand, random points from the Boolean 
cube have Hamming weight Vt(n), except with exponentially small probability. Thus, o(n)-local 
MQs are of no use to query the points e % . The labels for any point obtained from the distribution, 
or using o(n)-local MQs are essentially random. Hence, Q' n is not learnable in the PAC+o(n)-local 
MQ model. □ 



6 Conclusion and Future Work 

We introduced the local membership query model, with the goal of studying query algorithms 
that may be useful in practice. With the rise of crowdsourcing tools, it is increasingly possible to 
get human labellers for a variety of tasks. Thus, membership queries beyond the standard active 
learning paradigm could prove to be useful to increase the efficiency and accuracy of learning. In 
order to make use of human labellers, it is necessary to make queries that make sense to them. In 
some ways, our algorithms can be understood as searching for higher-dimensional (deeper) features 
using queries that modify the examples locally. 

Our model of local membership queries is also a very natural and simple theoretical model. 
There are several interesting open questions: (i) can the class of t-le&l decision trees (without depth 
restriction) be learned under the class of smooth distributions? (ii) is the class of DNF formulas 
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learnable, at least under the uniform distribution? Another interesting question is whether a general 
purpose boosting algorithm exists that only uses a-smooth distributions. This looks difficult since 
most boosting algorithms decrease weights of points substantially 7 . 

It is also interesting to see whether agnostic learning of any interesting concept classes is possible 
in this learning model. We observe that learning the class of 0(log(n))-sized parities and the class 
of decision-trees is equivalent in the agnostic learning setting (even under smooth distributions), 
since weak and strong agnostic learning is equivalent even with respect to a fixed distribution 
[KK09, FellO]. Agnostic learning 0(log(n))-sized parities (even with respect to a fixed distribution) 
would also imply (PAC) learning DNF in our model with local membership queries (with respect 
to the same distribution) [KKM09]. 
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A Learning under Random Classification Noise 

The proof of the following two lemmas are elementary and are omitted. 

Lemma A.l. Suppose, Xq = 0. Consider the following random walk, Xi + \ = Xi with probability 
1 — a, Aj + i = Xi + 2, with probability a/2 and Xi + i = Xi — 2, with probability a/2, where 
a £ [0, 1/2]. Then, for i > 0, Pr[A n = 0] — Pr[A n = 2] is a decreasing function of a. 

The idea of the proof is to notice that the probability, Pr[A n = 2j] follows a bell shaped curve, 
and the curve gets steeper (more mass is concentrated at 0) as a goes to 0. 

Lemma A. 2. Let x\, . . . , X2 n , be such that x\ = • • • = x n+c [ = 1 and x n+ d+i = x n+ i + 2 = ■ ■ ■ X2 n = 
— 1. The sign of each Xi is flipped independently with probability rj < 1/2, to get x\. Let be the 
probability that the x[ = 0. Then for d > 0, as d increases, p 1 ^ decreases. 

This expresses the quite obvious idea that if the probability of flipping is less than half, then 
the further from the initial sum Xi), the less likely it is that Y2i x '% = 0- 

Lemma A. 3. Let x%, . . . , X2n, be such that x\ = ■ ■ ■ x n+ \ = 1 and x n +2 = • • • = X2 n = —1- The sign 
of each Xi is flipped independently with probability rj < 1/2, to get x\. Let p\ denote the probability 
that x\ = 0. Let y\, . . . , y2 n be such that, y± = ■ ■ ■ y n = 1 and y n +i = ■ ■ ■ = y2 n = —1- Then, let 
y[ be obtained by flipping yi independently with probability n < 1/2, and let po denote the probability 
that Hi = 0. Then po — pi > (2n — l) 2 co/n 3 / 2 , for some absolute constant cq. 
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Proof. First we leave aside the values, x' n , x' n+1 , y' n and y' n , 1 . The remaining variables, both in 
the case of X{S and g/jS, were obtained by starting with exactly (n — 1) +ls and (n — 1) —Is and 
flipping each independently with probability r/ < 1/2. We can form pairs of (+1,-1), to get a 
random variable Zi = x\ + x' n+1+i , i = 1, . . . ,n — 1, where Zi = with probability rj 2 + (1 — 
r/) 2 > 1/2, = +2 with probability 77(1 — rj) and Z{ = —2 with probability — 7]). (A similar 
argument can be made in the case of y£s.) We can view the sum of these Zi random variables 
as a random walk described in Lemma A.l, where X{ + \ = Xi with probability rf + (1 — rj) 2 and 
Xi + \ = Xi + 2, with probability r/(l — 77) and Xi + \ = Xi — 2, with probability r/(l — 77). Now, 
Pi = Pr[X n _i = 0](2? ? (1 - T])) + Pr[X n _i = 2]r/ 2 + Pr[A" n _i = -2](1 - r]) 2 . On the other hand, 
Po = Pi[X n ^ = 0](t7 2 + (1 - T]) 2 ) + Pr[X„_! = 2} V (1 - r,) + Pr[X„_! = -2]r/(l - rj). Noticing that, 
Pr[X„_! = 2] = Pr[X n _! = -2], we get that p - Pl = (2 V - l) 2 (Pr[X n _! = 0] - Pr[X n _! = 2]). 
But this difference is a decreasing function of a = 1 — {rj 2 + (1 — rjj 2 ). But, even when a = 1/2, i.e. 
r] = 1/2, this difference is given by, 



Pr[X n _! = 0] - Pr[X n „x = 2] 




2n - 2 



) 



) 



The claim now follows easily. 



□ 
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